NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Multi-Agent Reinforcement Learning with Serverless Computing

https://doi.org/10.1145/3772052.3772227

Wei, Rui; Yu, Hanfei; Song, Xikang; Li, Jian; Tiwari, Devesh; Mao, Ying; Wang, Hao (November 2025, ACM Symposium on Cloud Computing 2025)

Full Text Available
Nitro: Boosting Distributed Reinforcement Learning with Serverless Computing

https://doi.org/10.14778/3696435.3696441

Yu, Hanfei; Carter, Jacob; Wang, Hao; Tiwari, Devesh; Li, Jian; Park, Seung-Jong (September 2025, Proceedings of the VLDB Endowment)

Deep reinforcement learning (DRL) has demonstrated significant potential in various applications, including gaming AI, robotics, and system scheduling. DRL algorithms produce, sample, and learn from training data online through a trial-and-error process, demanding considerable time and computational resources. To address this, distributed DRL algorithms and paradigms have been developed to expedite training using extensive resources. Through carefully designed experiments, we are the first to observe that strategically increasing the actor-environment interactions by spawning more concurrent actors at certain training rounds within ephemeral time frames can significantly enhance training efficiency. Yet, current distributed DRL solutions, which are predominantly server-based (or serverful), fail to capitalize on these opportunities due to their long startup times, limited adaptability, and cumbersome scalability. This paper proposesNitro, a generic training engine for distributed DRL algorithms that enforces timely and effective boosting with concurrent actors instantaneously spawned by serverless computing. With serverless functions,Nitroadjusts data sampling strategies dynamically according to the DRL training demands.Nitroseizes the opportunity of real-time boosting by accurately and swiftly detecting an empirical metric. To achieve cost efficiency, we design a heuristic actor scaling algorithm to guideNitrofor cost-aware boosting budget allocation. We integrateNitrowith state-of-the-art DRL algorithms and frameworks and evaluate them on AWS EC2 and Lambda. Experiments with Mujoco and Atari benchmarks show thatNitroimproves the final rewards (i.e., training quality) by up to 6× and reduces training costs by up to 42%.
more » « less
Full Text Available
Pre-Warming is Not Enough: Accelerating Serverless Inference With Opportunistic Pre-Loading

https://doi.org/10.1145/3698038.3698509

Sui, Yifan; Yu, Hanfei; Hu, Yitao; Li, Jianxun; Wang, Hao (November 2024, ACM)

Full Text Available
Stellaris: Staleness-Aware Distributed Reinforcement Learning with Serverless Computing

https://doi.org/10.1109/SC41406.2024.00045

Yu, Hanfei; Wang, Hao; Tiwari, Devesh; Li, Jian; Park, Seung-Jong (November 2024, IEEE)

Full Text Available
Freyr+: Harvesting Idle Resources in Serverless Computing via Deep Reinforcement Learning

https://doi.org/10.1109/TPDS.2024.3462294

Yu, Hanfei; Wang, Hao; Li, Jian; Yuan, Xu; Park, Seung-Jong (November 2024, IEEE Transactions on Parallel and Distributed Systems)

Full Text Available
Accelerating ML Inference via Opportunistic Pre-Loading on Serverless Clusters

https://doi.org/10.1109/TPDS.2025.3638428

Sui, Yifan; Yu, Hanfei; Hu, Yitao; Li, Jianxun; Wang, Hao (February 2026, IEEE Transactions on Parallel and Distributed Systems)

Full Text Available
Cheaper and Faster: Distributed Deep Reinforcement Learning with Serverless Computing

https://doi.org/10.1609/aaai.v38i15.29592

Yu, Hanfei; Li, Jian; Hua, Yang; Yuan, Xu; Wang, Hao (March 2024, Proceedings of the AAAI Conference on Artificial Intelligence)

Deep reinforcement learning (DRL) has gained immense success in many applications, including gaming AI, robotics, and system scheduling. Distributed algorithms and architectures have been vastly proposed (e.g., actor-learner architecture) to accelerate DRL training with large-scale server-based clusters. However, training on-policy algorithms with the actor-learner architecture unavoidably induces resource wasting due to synchronization between learners and actors, thus resulting in significantly extra billing. As a promising alternative, serverless computing naturally fits on-policy synchronization and alleviates resource wasting in distributed DRL training with pay-as-you-go pricing. Yet, none has leveraged serverless computing to facilitate DRL training. This paper proposes MinionsRL, the first serverless distributed DRL training framework that aims to accelerate DRL training- and cost-efficiency with dynamic actor scaling. We prototype MinionsRL on top of Microsoft Azure Container Instances and evaluate it with popular DRL tasks from OpenAI Gym. Extensive experiments show that MinionsRL reduces total training time by up to 52% and training cost by 86% compared to latest solutions.
more » « less
Full Text Available
RainbowCake: Mitigating Cold-starts in Serverless with Layer-wise Container Caching and Sharing

https://doi.org/10.1145/3617232.3624871

Yu, Hanfei; Basu_Roy, Rohan; Fontenot, Christian; Tiwari, Devesh; Li, Jian; Zhang, Hong; Wang, Hao; Park, Seung-Jong (April 2024, ACM)

Full Text Available
Libra: Harvesting Idle Resources Safely and Timely in Serverless Clusters

https://doi.org/10.1145/3588195.3592996

Yu, Hanfei; Fontenot, Christian; Wang, Hao; Li, Jian; Yuan, Xu; Park, Seung-Jong (August 2023, ACM)

Serverless computing has been favored by users and infrastructure providers from various industries, including online services and scientific computing. Users enjoy its auto-scaling and ease-of-management, and providers own more control to optimize their service. However, existing serverless platforms still require users to pre-define resource allocations for their functions, leading to frequent misconfiguration by inexperienced users in practice. Besides, functions' varying input data further escalate the gap between their dynamic resource demands and static allocations, leaving functions either over-provisioned or under-provisioned. This paper presents Libra, a safe and timely resource harvesting framework for multi-node serverless clusters. Libra makes precise harvesting decisions to accelerate function invocations with harvested resources and jointly improve resource utilization by profiling dynamic resource demands and availability proactively. Experiments on OpenWhisk clusters with real-world workloads show that Libra reduces response latency by 39% and achieves 3X resource utilization compared to state-of-the-art solutions.
more » « less
Full Text Available
Accelerating Serverless Computing by Harvesting Idle Resources

https://doi.org/10.1145/3485447.3511979

Yu, Hanfei; Wang, Hao; Li, Jian; Yuan, Xu; Park, Seung-Jong (April 2022, Proceedings of the ACM Web Conference 2022)

Serverless computing automates fine-grained resource scaling and simplifies the development and deployment of online services with stateless functions. However, it is still non-trivial for users to allocate appropriate resources due to various function types, dependencies, and input sizes. Misconfiguration of resource allocations leaves functions either under-provisioned or over-provisioned and leads to continuous low resource utilization. This paper presents Freyr, a new resource manager (RM) for serverless platforms that maximizes resource efficiency by dynamically harvesting idle resources from over-provisioned functions to under-provisioned functions. Freyr monitors each function’s resource utilization in real-time, detects over-provisioning and under-provisioning, and learns to harvest idle resources safely and accelerates functions efficiently by applying deep reinforcement learning algorithms along with a safeguard mechanism. We have implemented and deployed a Freyr prototype in a 13-node Apache OpenWhisk cluster. Experimental results show that 38.8% of function invocations have idle resources harvested by Freyr, and 39.2% of invocations are accelerated by the harvested resources. Freyr reduces the 99th-percentile function response latency by 32.1% compared to the baseline RMs.
more » « less
Full Text Available

« Prev Next »

Search for: All records